Patrick Cao
Insider trading, or trading financial assets using information that is not open to the public, is punishable by law in the United States. It is a serious violation -- a maximum penalty of 20 years in prison, and up to $20 million in fines. The Securities and Exchange Commission (SEC) is responsible for identifying and conducting investigations regarding insider trading.
With that being said, U.S. senators (and other congresspeople) often have access to information about upcoming policies or laws before the general public, yet are still allowed to trade securities. Senators, however, must disclose their trades publicly. More recently, a few senators have been investigated by the SEC for insider trading using nonpublic knowledge given to them about the COVID-19 pandemic. With the rise in popularity of investing, senator stock trading patterns has become to the public's attention. You may have seen posts on the internet indicating the timing of senators trades -- right before major crashes or rises -- is very good, implying that there is rampant insider trading going on in the senate. For example, take this post on Reddit that was upvoted over 30,000 times: https://www.reddit.com/r/dataisbeautiful/comments/gjlvnd/.
Is insider trading in the senate as rampant as this post suggests? This tutorial aims to identify potential insider trading patterns by the U.S. senate. In order to answer this question, we will be using data from https://senatestockwatcher.com/, which itself compiles data from https://efdsearch.senate.gov/. efdsearch.senate.gov does not have an official API, and scraping each page would be extremely tedious. Senate stock watcher aggregates financial disclosures from all senators into one nice dataset.
Let's first start by downloading the dataset. While Senate Stock Watcher provides an API for up to date datasets, we can just download a csv file as well. Download the "All Transactions" csv file from https://senatestockwatcher.com/api.html.
Let's load that into a pandas dataframe.
import pandas as pd
import numpy as np
import datetime
df = pd.read_csv('all_transactions.csv')
df.head()
Let's clean this up a little by limiting and purchases and sales only, and dropping unnecessary columns.
start_date = '2016-01-01'
end_date = '2020-11-01'
# Turn transaction_date from string into datetime
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
# Remove (Full) and (Partial) after Sale type
df['type'] = df['type'].apply(lambda x: str(x).split()[0])
# Limit to only purchases and sales
df = df[(df['type'] == 'Purchase') | (df['type'] == 'Sale')]
# limit the dataset to a date range
df = df[(df['transaction_date'] > start_date) & (df['transaction_date'] < end_date)]
df = df.drop(['asset_description', 'comment', 'ptr_link'], axis=1)
df.head()
We got our senator stock trading data, but now we need actual historical market data to visualize these trades. Let's use yfinance, a python Yahoo! Finance library, to get historical market data on SPY, an ETF tracking the S&P500. This will allow us to easily pull historical price data of any stock from https://finance.yahoo.com/
import yfinance as yf
spy = yf.Ticker('SPY')
spy_hist = spy.history(period='max')
spy_hist.tail()
Neat! Now we can start putting these two things together.
import matplotlib.pyplot as plt
# number of trades by senators
num_trades = df.groupby('senator').count().sort_values(by='transaction_date', ascending=False).head(10)
plt.title('Top 10 Most Active Senators')
plt.xlabel('Total Number of Trades')
plt.ylabel('Senator')
plt.barh(num_trades.index, num_trades['transaction_date'])
plt.show()
# number of trades by ticker
trades_by_ticker = df[df['ticker'] != '--'].groupby('ticker').count().sort_values(by='transaction_date', ascending=False).head(20)
plt.title('Top 20 Most Traded Stocks')
plt.xlabel('Total Number of Trades')
plt.ylabel('Ticker')
plt.barh(trades_by_ticker.index, trades_by_ticker['transaction_date'])
plt.show()
# values of trades
trade_val = df.groupby('amount').count().sort_values(by='transaction_date', ascending=False)
plt.title('Value of Trades')
plt.xlabel('Number of Trades')
plt.ylabel('Value of Trade ($)')
plt.barh(trade_val.index, trade_val['transaction_date'])
plt.show()
It looks like most trades by senators are small -- under $15,000 per trade -- and they like trading AAPL, which is coincidentally the largest market cap company in the world. There's a good mix of tech, pharma, communications, and retail stocks that are the most commonly traded in the senate. There's not too much insight we can pull from this, but it is interesting.
Let's now try to visualize senator trading activity over time. We can also plot overall senator trading activity against the S&P500 for comparison. We can use the value of each trade to quantify senator trading activity.
It's a little tough to track performance exactly. The value of each trade isn't an exact number -- it's a range. Let's check out what kinds of values we're working with here.
df['amount'].unique()
Yikes. The data doesn't have the exact value of trades, and the granularity of each bucket is pretty large. The next best thing we can probably do is estimate the size of each trade by just taking the middle value of each trade, and just count the number of buys and sells each day. Thus, we have 2 differnet metrics to represent senator trading activity: number of buys and sells per day, and value of all trades per day.
# Returns the average of the given bound
def getBounds(row):
amount = row[0]
trade_type = row[1]
# This amount shows up once in the entire dataset. Let's just assume the trade is $50m for now. It won't affect our analysis too much.
if amount == 'Over $50,000,000':
return 5e7
split = amount.split(' - ')
lower = int(split[0][1:].replace(',', ''))
upper = int(split[1][1:].replace(',', ''))
mid = lower + ((upper - lower) / 2)
if trade_type == 'Purchase':
return mid
return -mid
df['est_amt'] = pd.Series(df[['amount', 'type']].apply(getBounds, axis=1))
df
Alright, now we got some rough estimates of the value of each transaction, we can group trades together by date and find the aggregate for each day.
aggregate_action_with_type = df.groupby([pd.Grouper(key='transaction_date', freq='D'), 'asset_type', 'type']).agg(est_amt=('est_amt', 'sum'), num_trades=('est_amt', 'count')).reset_index()
# because a sale is the opposite of a purchase, we want to show that
aggregate_action_by_asset = aggregate_action_with_type.copy()
aggregate_action_by_asset['num_purchases'] = aggregate_action_by_asset.apply(lambda row: -row['num_trades'] if row['type'] == 'Sale' else row['num_trades'], axis=1)
aggregate_action_by_asset = aggregate_action_by_asset.groupby(['transaction_date', 'asset_type']).sum().reset_index()
aggregate_action = aggregate_action_by_asset.groupby('transaction_date').sum().reset_index()
print(aggregate_action_by_asset.head())
print(aggregate_action.head())
We are now ready to plot senator trades against the market.
spy_hist['Date'] = spy_hist.index
# since we've limited the senator trades to a date range, we want to also limit SPY history
# to the same date range
market = spy_hist[(spy_hist['Date'] > start_date) & (spy_hist['Date'] < end_date)]
def plot_trades_vs_spy(trades, asset_type, metric, y_label):
plt.figure(figsize=(18,6))
plt.plot(market['Date'], market['Close'])
plt.xlabel('Year')
plt.ylabel('SPY price ($)')
ax2 = plt.twinx()
# scale bar graph limits s.t. 0 is centered in graph
limit = max(abs(trades[metric].min()), trades[metric].max()) * 1.1
ax2.set_ylim(-limit, limit)
ax2.spines['bottom'].set_position(('data', 0))
plt.ylabel(y_label)
plt.bar(trades['transaction_date'], trades[metric], width=5, color='orange')
plt.title('Senator Trades of ' + str(asset_type) + ' (' + metric + ')' + ' vs. SPY price')
plt.show()
for asset_type in aggregate_action_by_asset['asset_type'].unique():
trades_by_asset = aggregate_action_by_asset[aggregate_action_by_asset['asset_type'] == asset_type]
plot_trades_vs_spy(trades_by_asset, asset_type, 'num_purchases', 'Number of Purchases/Day')
plot_trades_vs_spy(trades_by_asset, asset_type, 'est_amt', 'Net Securities Traded/Day ($)')
plot_trades_vs_spy(aggregate_action, 'Overall', 'num_purchases', 'Number of Purchases/Day')
plot_trades_vs_spy(aggregate_action, 'Overall', 'est_amt', 'Net Securities Traded/Day ($)')
Is there a relationship between these trades and future SPY price? These plots look interesting, but because we plot both senate trading activity and S&P500 performance over time, but we can't really tell if there is a consistent pattern here. It would be helpful to instead plot this data on a scatterplot. Let's plot these trades vs. future SPY price as well. To find the price of SPY a number of days in the future, we need to know all the open market days. Luckily, there is a library for that.
But how long in the future should we check? It's probably not feasible to check a lot of different number of days in the future, so we have to pick this value somewhat arbitrarily. Let's just pick 1 week and 3 weeks. There are 5 trading days per week, so when looking at future trading days, we could look at SPY prices 5 and 15 days in the future.
import pandas_market_calendars as mcal
from datetime import datetime
nyse = mcal.get_calendar('NYSE')
market_open_days = nyse.valid_days(start_date=start_date, end_date=datetime.today().strftime('%Y-%m-%d'))
# returns the spy price num_days from today
def get_future_spy_price(num_days, today):
# calculate current spy data
today_index = market_open_days.get_loc(today, method='nearest')
today_close = spy_hist.iloc[spy_hist.index.get_loc(today, method='nearest')]['Close']
# calculate future date and get spy data
future_index = today_index + num_days
future_date = market_open_days[future_index]
future_close = spy_hist.iloc[spy_hist.index.get_loc(future_date, method='nearest')]['Close']
return 100 * (future_close - today_close) / today_close
aggregate_action_by_asset['5_day_spy_pct_change'] = aggregate_action_by_asset['transaction_date'].apply(lambda date: get_future_spy_price(5, date))
aggregate_action_by_asset['15_day_spy_pct_change'] = aggregate_action_by_asset['transaction_date'].apply(lambda date: get_future_spy_price(15, date))
aggregate_action['5_day_spy_pct_change'] = aggregate_action['transaction_date'].apply(lambda date: get_future_spy_price(5, date))
aggregate_action['15_day_spy_pct_change'] = aggregate_action['transaction_date'].apply(lambda date: get_future_spy_price(15, date))
aggregate_action
Now that we have our data, we can begin plotting. We want to plot value of trades vs. future value, and number of purchases vs. future value for both 5 and 15 days in the future.
Let's now also use some machine learning techniques to try and uncover patterns in each senator's trade. Since we suspect there might be a relationship between the future value of a stock and a senator's trades (quantified by value or number of purchases), which are both quantitative variables, we can try to fit a linear regression line to each plot.
import statsmodels.api as sm
def set_subplot(ax, x, y, title, x_label, y_label):
# regression line
X = np.array(x).reshape(len(x), 1)
# add intercept term
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
label = 'r^2=' + str(round(model.rsquared, 5))
# try to get the p-value -- some datasets with no data
# will cause an error, so just skip those
try:
p = ' p=' + str(round(model.pvalues[1], 3))
label = label + p
except:
pass
ax.plot(x, model.predict(X), color='red', label=label)
ax.scatter(x, y)
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
ax.grid()
ax.legend()
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
def plot_trades_vs_future_spy_price(trades, title):
fig, axs = plt.subplots(1, 4, figsize=(20, 4))
fig.suptitle('Trades vs. Future SPY Price % Change for ' + title)
set_subplot(axs[0], trades['est_amt'], trades['5_day_spy_pct_change'], '5 day', 'Net Value of Assets Bought/Sold', '5 Day % Change in S&P500')
set_subplot(axs[1], trades['num_purchases'], trades['5_day_spy_pct_change'], '5 day', 'Number of Purchases (Sales count as negative)', '5 Day % Change in S&P500')
set_subplot(axs[2], trades['est_amt'], trades['15_day_spy_pct_change'], '15 day', 'Net Value of Assets Bought/Sold', '15 Day % Change in S&P500')
set_subplot(axs[3], trades['num_purchases'], trades['15_day_spy_pct_change'], '15 day', 'Number of Purchases (Sales count as negative)', '15 Day % Change in S&P500')
plt.show()
for asset_type in aggregate_action_by_asset['asset_type'].unique():
trades_by_asset = aggregate_action_by_asset[aggregate_action_by_asset['asset_type'] == asset_type]
plot_trades_vs_future_spy_price(trades_by_asset, asset_type)
plot_trades_vs_future_spy_price(aggregate_action, 'Overall')
Bottom line: these linear regression models fit the data very poorly. Most of our R^2 values are extremely small, and p-values are large, indicating little to no correlation.
The highest R^2 value was ~.197, for the Other Securities 15 day Net Value of Assets Bought/Sold plot. This plot is mostly skewed by the one outlier trade worth about 15 million, when SPY tanked almost 30% over the next 15 days. As an aside: this seems like an unusually well timed trade. Let's take a look at this data point.
aggregate_action_by_asset[aggregate_action_by_asset['asset_type'] == 'Other Securities'].sort_values(by='est_amt').head()
df[(df['transaction_date'] == '2020-03-02') & (df['asset_type'] == 'Other Securities')]
Turns out this trade was by Senator Ron Johnson, who sold between \$5m and \$25m of assets before the COVID crash. He came under fire for trading on insider knowledge, but was investigated and eventually cleared.
We can safely conclude that there is little to no relationship between the entire senate's trading patterns and future SPY price.
However, I would imagine that not all senators would partake in insider trading. As a result, our data may be skewed by senators who don't insider trade. Let's now visualize the performance of each senator separately to try to find senators that routinely execute well timed trades. We will not be grouping the data by asset type for the sake of not having 50 * 6 plots, and instead only look at the trades of all asset types by each senator.
aggregate_action_by_senator_with_type = df.groupby([pd.Grouper(key='transaction_date', freq='D'), 'type', 'senator']).agg(est_amt=('est_amt', 'sum'), num_trades=('est_amt', 'count')).reset_index()
# because a sale is the opposite of a purchase, we want to show that
aggregate_action_by_senator = aggregate_action_by_senator_with_type.copy()
aggregate_action_by_senator['num_purchases'] = aggregate_action_by_senator.apply(lambda row: -row['num_trades'] if row['type'] == 'Sale' else row['num_trades'], axis=1)
aggregate_action_by_senator = aggregate_action_by_senator.groupby(['transaction_date', 'senator']).sum().reset_index()
aggregate_action_by_senator['5_day_spy_pct_change'] = aggregate_action_by_senator['transaction_date'].apply(lambda date: get_future_spy_price(5, date))
aggregate_action_by_senator['15_day_spy_pct_change'] = aggregate_action_by_senator['transaction_date'].apply(lambda date: get_future_spy_price(15, date))
for senator in aggregate_action_by_senator['senator'].unique():
trades_by_senator = aggregate_action_by_senator[aggregate_action_by_senator['senator'] == senator]
plot_trades_vs_future_spy_price(trades_by_senator, 'Senator ' + senator)
From these graphs, we can again find no evidence of consistent insider trades by any specific senator. Not considering senators with very few trades, R-squared values are generally very low and p-values are generally very high, indicating very little correlation between trades and future SPY price.
Senators with plots with p<=.05, indicating that their trade patterns have some significant relationship with future S&P price include:
Although some R^2 values are relatively high for some plots with positively sloping regression lines -- for example, Sen. Tammy Duckworth's 15 day plot of Net Value of Assets Bought/Sold has a regression line with R^2=.46 -- this does not indicate that these senators have well timed trades. Looking at Senator Duckworth's trades, nearly all of her trades have been sales, and the S&P500 has risen after almost all of her trades.
Our analysis has been inconclusive so far, but we can still take a different approach. The SP500 represents the aggregate performance of 500 different companies in the index. Although the stock market largely follows the same trends, individual stocks may react differently to world events. Thus, more insight may be gained if we analyze the performance of individual stock picks by senators. For example, David Purdue bought stock in PPE manufacturers just before COVID-19 panic started to spread in the U.S., prompting SEC investigation. Thus, to gain more insight, we should instead evaluate the performance of different stocks picked by senators, rather than the S&P500.
# returns a function that gets the spy price num_days from today
def get_future_stock_price(ticker, today, num_days):
try:
stock_history = yf.Ticker(ticker).history(period='5y')
# calculate current stock data
today_index = market_open_days.get_loc(today, method='nearest')
today_close = stock_history.iloc[stock_history.index.get_loc(today, method='nearest')]['Close']
# calculate future date and get stock data
future_index = today_index + num_days
future_date = market_open_days[future_index]
future_close = stock_history.iloc[stock_history.index.get_loc(future_date, method='nearest')]['Close']
return (future_close - today_close) / today_close
# yfinance sometimes cannot find certain tickers -- just return nan on error
except Exception as e:
return np.nan
stock_trades = df[df['ticker'] != '--']
# we're processing an insane amount of data here, so let's save this dataframe back to the disk
# so we can read it back easily
try:
stock_trades = pd.read_csv('stock_trading_performance.csv')
except FileNotFoundError:
print('calculating...')
stock_trades['5_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 5), axis=1)
print('done with 5 day')
stock_trades['15_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 15), axis=1)
print('done with 15 day')
stock_trades['30_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 30), axis=1)
print('done with 15 day')
# save this df to disk
stock_trades.to_csv('stock_trading_performance.csv')
stock_trades.head()
def set_individual_stock_subplot(ax, trades, percent_change_col, num_future_days):
x = trades['est_amt']
y = trades[percent_change_col]
# for some senators, the only data they have is missing, and cause issues with
# statsmodels.OLS
# let's just skip these for now
try:
# regression line
X = np.array(x).reshape(len(x), 1)
# add intercept term
X = sm.add_constant(X)
model = sm.OLS(y, X, missing='drop').fit()
ax.plot(x, model.predict(X), color='red', label='r^2=' + str(round(model.rsquared, 5)) + ' p=' + str(round(model.pvalues[1], 3)))
ax.legend()
except:
pass
ax.scatter(x, y)
ax.set_title(str(num_future_days) + ' Day')
ax.set_xlabel('Value of Stock Bought/Sold')
ax.set_ylabel(str(num_future_days) + ' Day % Change in Stock Price')
ax.grid()
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
def plot_future_stock_performance(trades, senator):
fig, axs = plt.subplots(1, 3, figsize=(20, 5))
fig.suptitle('Trade Value vs. Future Stock Price % Change for Senator ' + senator)
set_individual_stock_subplot(axs[0], trades, '5_day_pct_change', 5)
set_individual_stock_subplot(axs[1], trades, '15_day_pct_change', 15)
set_individual_stock_subplot(axs[2], trades, '30_day_pct_change', 30)
plt.show()
plot_future_stock_performance(stock_trades, 'Overall')
for senator in stock_trades['senator'].unique():
stock_trades_by_senator = stock_trades[stock_trades['senator'] == senator]
plot_future_stock_performance(stock_trades_by_senator, senator)
Disregarding senators with few trades, there is again little correlation between the value of stock bought or sold, and the future price of the stock for most senators, evidenced by the low R-squared values. However, there are some senators that may have trading patterns of interest. Some plots have p-values that are small, indicating we can reject the null hypothesis that the value of stock bought/sold is not a meaningful addition to our model. Senators with a p<=.05 include:
However, this does not indicate that these senators are insider trading. Out of these 7 senators, only 2 had a positive sloping regression line, meaning that the more of a stock they bought, the more likely the stock price went up in the near future. These two senators were Senator Christopher A. Coons, whose 5 day plot had a R-squared value of .568 and Senator Angus S. King, Jr, whose 30 day plot had a R-squared value of ~.26.
The other 5 senators had negative sloping regression lines, meaning that the more of a stock they bought, the more likely the stock price was to go down.
This analysis has found that there is little correlation between the senate's overall asset trades and near future stock prices, suggesting that the majority of senators do not consistently place well timed trades. While this is not set in stone, this also suggests that the majority (or all) of the senate does not partake in insider trading consistently.
Through analysis of individual senators' asset trading, we found senators whose trades had a significant correlation (positive correlations bolded) with near future S&P500 prices. This list includes:
Through analysis of individual trades, we have also found that there is also little correlation between most individual picks by each senator, and their future stock prices, suggesting that the majority of senators do not consisntely execute well timed trades. On the contrary, we found that there are many senators that instead consistently execute poorly timed trades. The two senators with relatively well timed trades include Senator Angus S. King Jr, and Senator Christopher A. Coons. The full list of senators whose individual stock picks had a significant correlation (positive correlations bolded) with near future stock prices include:
However, just because these senators' trades correlate to future stock prices, this is not evidence that these senators are using insider knowledge for their own gain. This stock market is unpredictable -- these senators might just be lucky (or unlucky), or are good traders but only use public information to trade. This analysis has simply produced a list of senators who execute relatively well timed trades.
This analysis aimed to find senators that consistently placed well timed trades. However, senators might also use insider knowledge for their own benefit only occasionally, which this analysis does not identify. In the future, we might analyze a senator's trading patterns, and use unsupervised machine learning techniques to identify unusual trades that might be based off insider knowledge.